Data description

This dataset contains categorical variable. We are to examine models for a categorical variable

The palmerpenguins is a new R data package, with interesting measurements on penguins of three different species. Subset the data to contain just the Adelie and Gentoo species, and only the variables species and the four physical size measurement variables. Use standardised variables for answering all of the questions.

Exploratory analysis

Make a scatterplot matrix of the data, with species mapped to colour. Which variable(s) would you expect to be the most important for distinguishing between the species?

The shapes of all variables are overlapped for classifying both groups (which is not good), except the one of bl, therefore, bl is the most important variable to distinguish the group.

Data spiliting

Break the data into training and test sets, using stratified sampling, and with 2022 as the random number seed.
Model Summary
term estimate std.error statistic p.value
(Intercept) -83.081716 23765.45 -0.0034959 0.9972107
bl 177.556640 48290.72 0.0036768 0.9970663
bd -48.790915 16882.11 -0.0028901 0.9976940
fl 5.790942 10457.84 0.0005537 0.9995582
bm -56.756407 23781.26 -0.0023866 0.9980958
Confusion Matrix (Training set)
.pred_class 0 1
0 100 0
1 0 45

Logistic model

Fit a logistic regression model to the data, where Adelie are coded as 0, and Chinstrap are coded as 1. Report the model summary, including parameter estimates, and the overall and species misclassification rates. (Note: There is a warning from the model fitting in R. What does this mean? Will it invalidate what can be the model fit?)

For this analysis, we are interested to explore the categorization of Adelie and Chinstrap, which are coded as 0 and 1 respectively.

The summary tells that 1) all the variables are of significance, given all p-values >0.05. 2) The estimates are \(b_0=-83.08172, ~b_1=177.55664, ~b_2=-48.79091, ~b_3=5.79094, ~b_4=-56.75641\)

The confusion table tells that the data is perfectly fitting the penguin species,with misclassification rate being 0; this attributes to bl being able to distinguish most of the species when regressed against bd, fl, bm.

The warning message tells that the current model has perfect fit. The current predictors are two complete separation (i.e., 0 or 1), which is common in logistic regression. Based on the slides - “Logistic regression model fitting fails when the data is perfectly separated.” - we can conclude that it will invalidate what can be the model fit.

Math equation

Write down the fitted logistic regression model, mathematically. Explain how this would be used for classifying the two species (in 30 words or less).

\[\begin{aligned}P(species = Chinstrap~|~\widehat{\beta_0} + \widehat{\beta_1} \times bl + \widehat{\beta_2}\times bd + \widehat{\beta_3}\times fl + \widehat{\beta_4}\times bm) \\ =\frac{exp ^{177.56\times bl - 48.79 \times bd + 5.79 \times fl - 56.75 \times bm -83.03}}{1 + exp^{ 177.56\times bl - 48.79 \times bd + 5.79 \times fl - 56.75 \times bm -83.08}} \end{aligned}\]

The probability of a certain species is classified with the given observed value of the variables. Let’s say we are to calculate the probability of the species being Chinstrap (i.e., coded as 1 in this case), given that all observed value of those variables. If it is >0.5, we classify it as Chinstrap, otherwise as Adelie.

Plot about how well the species are separated.

The fitted values of a simple linear regression are linear combinations of the observed variables. Therefore, we can make this line to be the x-axis. Same as illustrated from the confusion matrix, the species are separated perfectly, with the point of 0 being the separation

Prediction 1

Predict the class of a penguin with these characteristics: bill_length_mm = 45, bill_depth_mm = 18, flipper_length_mm = 190, body_mass_g = 3750. How confident are we in our prediction?

\[\begin{aligned}P(species = Chinstrap|\widehat{\beta_0} + \widehat{\beta_1} \times bl + \widehat{\beta_2}\times bd + \widehat{\beta_3}\times fl + \widehat{\beta_4}\times bm) \\ \\ =\frac{exp ^{177.56\times 0.56 - 48.79 \times -0.31 + 5.79 \times -0.24 - 56.75 \times 0.09 -83.03}}{1 + exp^{ 177.56\times 0.56 - 48.79 \times -0.31 + 5.79 \times -0.24 - 56.75 \times 0.09 -83.08}} \end{aligned} \\ = 1\]

Based on the model, the illustrated calculation is closed to 1, hence we can be very certain to classify this as “Chinstrap”.

Prediction 2

Predict the test set, both as a proportion and as a class. Plot the predictions on your plot from part d. Using the class predictions report the confusion table and the misclassification error for the test set.

From the plot above, there are 2 misclassification. The overall misclassification rate is 2.7 %, with the Chinstrap’s one being 0% and Adelie’s one being 8.7%.

Confusion Matrix (Test set)
.pred_class 0 1
0 51 2
1 0 21

We found a peculiar point.

Observation 55 is misclassified by the model. Here explains why it was likely confused:

The 0 line separates the 2 species; if the predicted values is <0 on the axis of linear combination, the model classifies the result has Adelie, vice versa. Here observation 55’s predicted value is <0, hence the model see it as Adelie.

The attribution mighe be:

1) This penguin has the similar characteristic of Adelie albeit Chinstrap, meaning that it might has malnutrition, or naturally has Adelie’s characteristic.

2) The model is trained with imbalanced training dataset with Adelie weighting 68.97%. The ambiguous values of Observation 55 makes the model decision prones to Adelie, as it cannot learn sufficient patterns from the minority class (i.e., Chinstrap).

Linear combination

Starting with the linear combination provided by the logistic regression model, converted to a projection basis, use a manual tour with the spinifex package to determine which variable(s) might not be important for separating the two classes (code to help you is below).

Both species have the best split when the frame is set to be 66, in which \(bl\) is the best variable with the longest bar. Whereas variable \(fl\) is the most trivial (unimportant) one for separating the two classes, with the shortest bar. When sliding the frame of manural tour, \(fl\) is the always shortest.